{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Non i.i.d. data\n",
    "\n",
    "In machine learning, it is quite common to assume that the data are i.i.d,\n",
    "meaning that the generative process does not have any memory of past samples\n",
    "to generate new samples.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">i.i.d is the acronym of \"independent and identically distributed\"\n",
    "(as in \"independent and identically distributed random variables\").</p>\n",
    "</div>\n",
    "\n",
    "This assumption is usually violated in time series, where each sample can be\n",
    "influenced by previous samples (both their feature and target values) in an\n",
    "inherently ordered sequence.\n",
    "\n",
    "In this notebook we demonstrate the issues that arise when using the\n",
    "cross-validation strategies we have presented so far, along with non-i.i.d.\n",
    "data. For such purpose we load financial\n",
    "quotations from some energy companies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "symbols = {\n",
    "    \"TOT\": \"Total\",\n",
    "    \"XOM\": \"Exxon\",\n",
    "    \"CVX\": \"Chevron\",\n",
    "    \"COP\": \"ConocoPhillips\",\n",
    "    \"VLO\": \"Valero Energy\",\n",
    "}\n",
    "template_name = \"../datasets/financial-data/{}.csv\"\n",
    "\n",
    "quotes = {}\n",
    "for symbol in symbols:\n",
    "    data = pd.read_csv(\n",
    "        template_name.format(symbol), index_col=0, parse_dates=True\n",
    "    )\n",
    "    quotes[symbols[symbol]] = data[\"open\"]\n",
    "quotes = pd.DataFrame(quotes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can start by plotting the different financial quotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "quotes.plot()\n",
    "plt.ylabel(\"Quote value\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = plt.title(\"Stock values over time\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we want to predict the quotation of Chevron using all other energy\n",
    "companies' quotes. To make explanatory plots, we first use a train-test split\n",
    "and then we evaluate other cross-validation methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data, target = quotes.drop(columns=[\"Chevron\"]), quotes[\"Chevron\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use a decision tree regressor that we expect to overfit and thus not\n",
    "generalize to unseen data. We use a `ShuffleSplit` cross-validation to\n",
    "check the generalization performance of our model.\n",
    "\n",
    "Let's first define our model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "regressor = DecisionTreeRegressor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now the cross-validation strategy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then perform the evaluation using the `ShuffleSplit` strategy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2)\n",
    "print(f\"The mean R2 is: {test_score.mean():.2f} \u00b1 {test_score.std():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Surprisingly, we get outstanding generalization performance. We will\n",
    "investigate and find the reason for such good results with a model that is\n",
    "expected to fail. We previously mentioned that `ShuffleSplit` is a\n",
    "cross-validation method that iteratively shuffles and splits the data.\n",
    "\n",
    "We can simplify the\n",
    "procedure with a single split and plot the prediction. We can use\n",
    "`train_test_split` for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, shuffle=True, random_state=0\n",
    ")\n",
    "\n",
    "# Shuffling breaks the index order, but we still want it to be time-ordered\n",
    "data_train.sort_index(ascending=True, inplace=True)\n",
    "data_test.sort_index(ascending=True, inplace=True)\n",
    "target_train.sort_index(ascending=True, inplace=True)\n",
    "target_test.sort_index(ascending=True, inplace=True)\n",
    "\n",
    "regressor.fit(data_train, target_train)\n",
    "target_predicted = regressor.predict(data_test)\n",
    "# Recover the `DatetimeIndex` from `target_test` for correct plotting\n",
    "target_predicted = pd.Series(target_predicted, index=target_test.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the generalization performance of our model on this split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import r2_score\n",
    "\n",
    "test_score = r2_score(target_test, target_predicted)\n",
    "print(f\"The R2 on this single split is: {test_score:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we obtain good results in terms of $R^2$. We now plot the\n",
    "training, testing and prediction samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_train.plot(label=\"Training\")\n",
    "target_test.plot(label=\"Testing\")\n",
    "target_predicted.plot(label=\"Prediction\")\n",
    "\n",
    "plt.ylabel(\"Quote value\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = plt.title(\"Model predictions using a ShuffleSplit strategy\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the plot above, we can see that the training and testing samples are\n",
    "alternating. This structure effectively evaluates the model\u2019s ability to\n",
    "interpolate between neighboring data points, rather than its true\n",
    "generalization ability. As a result, the model\u2019s predictions are close to the\n",
    "actual values, even if it has not learned anything meaningful from the data.\n",
    "This is a form of **data leakage**, where the model gains access to future\n",
    "information (testing data) while training, leading to an over-optimistic\n",
    "estimate of the generalization performance.\n",
    "\n",
    "An easy way to verify this is to not shuffle the data during\n",
    "the split. In this case, we will use the first 75% of the data to train and\n",
    "the remaining data to test. This way we preserve the time order of the data, and\n",
    "ensure training on past data and evaluating on future data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data,\n",
    "    target,\n",
    "    shuffle=False,\n",
    ")\n",
    "regressor.fit(data_train, target_train)\n",
    "target_predicted = regressor.predict(data_test)\n",
    "target_predicted = pd.Series(target_predicted, index=target_test.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_score = r2_score(target_test, target_predicted)\n",
    "print(f\"The R2 on this single split is: {test_score:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we see that our model is not magical anymore. Remember that a\n",
    "negative $R^2$ means that the regressor performs worse than always predicting\n",
    "the mean of the target. We can visually check what we are predicting as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_train.plot(label=\"Training\")\n",
    "target_test.plot(label=\"Testing\")\n",
    "target_predicted.plot(label=\"Prediction\")\n",
    "\n",
    "plt.ylabel(\"Quote value\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = plt.title(\"Model predictions using a split without shuffling\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that our model cannot predict anything because it doesn't have samples\n",
    "around the testing sample. Let's check how we could have made a proper\n",
    "cross-validation scheme to get a reasonable generalization performance\n",
    "estimate.\n",
    "\n",
    "One solution would be to group the samples into time blocks, e.g. by quarter,\n",
    "and predict each group's information by using information from the other\n",
    "groups. We can use the `LeaveOneGroupOut` cross-validation for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import LeaveOneGroupOut\n",
    "\n",
    "groups = quotes.index.to_period(\"Q\")\n",
    "cv = LeaveOneGroupOut()\n",
    "test_score = cross_val_score(\n",
    "    regressor, data, target, cv=cv, groups=groups, n_jobs=2\n",
    ")\n",
    "print(f\"The mean R2 is: {test_score.mean():.2f} \u00b1 {test_score.std():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we see that we cannot make good predictions, which is less\n",
    "surprising than our original results.\n",
    "\n",
    "Another thing to consider is the actual application of our solution. If our\n",
    "model is aimed at forecasting (i.e., predicting future data from past data),\n",
    "we should not use training data that are ulterior to the testing data. In this\n",
    "case, we can use the `TimeSeriesSplit` cross-validation to enforce this\n",
    "behaviour."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import TimeSeriesSplit\n",
    "\n",
    "cv = TimeSeriesSplit(n_splits=groups.nunique())\n",
    "test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2)\n",
    "print(f\"The mean R2 is: {test_score.mean():.2f} \u00b1 {test_score.std():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In conclusion, it is really important not to carelessly use a\n",
    "cross-validation strategy which do not respect some assumptions such as having\n",
    "i.i.d data. It might lead to misleading outcomes, creating the false\n",
    "impression that a predictive model performs well when it may not be the case\n",
    "in the intended real-world scenario.\n",
    "\n",
    "scikit-learn offers useful tools for time-series analysis apart from\n",
    "`TimeSeriesSplit`, (see for instance the [Time-related feature engineering\n",
    "example](https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html)\n",
    "in the documentation), and scikit-learn models can yield even better results\n",
    "when combined with other specialized libraries."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}